Fast and accurate semi-supervised protein homology detection with large uncurated sequence databases

نویسندگان

Pai-Hsi Huang

Pavel Kuksa

Vladimir Pavlovic

چکیده

Establishing structural and functional relationship between sequences in the presence of only the primary sequence information is a key task in biological sequence analysis. This ability is critical for tasks such as inferring the superfamily membership of unannotated proteins (remote homology detection) when no secondary or tertiary structure is available. Recent methods such as profile kernels and mismatch neighborhood kernels have shown promising results by leveraging unlabeled data and explicit modeling mutations, insertions and deletions using mutational neighborhood. However, the size of such neighborhood exhibit exponential dependency on the cardinality of the alphabet set which incurs expensive cost for kernel evaluation and hence hinders the use of such powerful tools. Moreover, another missing component in previous studies for large-scale semi-supervised protein homology detection is a systematic and biologically motivated approach for leveraging the unlabeled data set. In this study, we propose a systematic and biologically motivated approach for extracting relevant information from unlabeled sequence databases. We also propose a method to remove the bias caused by overly represented sequences which are commonly seen in the unlabeled sequence databases. Combining these approaches with a class of kernels (sparse spatial sampling kernels, SSSK) that effectively model mutation, insertion, and deletion, we achieve fast and accurate semi-supervised protein homology detection on three large unlabeled databases. The resulting classifiers based on our proposed methods significantly outperform previously published state-of-the-art methods in performance accuracy and exhibit order-of-magnitude differences in experimental running time.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A fast, large-scale learning method for protein sequence classification

Motivation: Establishing structural and functional relationships between sequences in the presence of only the primary sequence information is a key task in biological sequence analysis. This ability can be critical for tasks such as making inferences of the structural class of unannotated proteins when no secondary or tertiary structure is available. Recent computational methods based on profi...

متن کامل

Expression analyses of endoglucanase gene in Penicillium oxalicum and Trichoderma viride

The expression of endoglucanase gene and protein profile belonging to two fungal species, Penicillium oxalicum 1SMS and Trichoderma viride 156MS with high cellulase enzyme activity, was investigated. Fungal isolates were cultured on inducer CMC medium and then the amount of released sugar and protein were assayed every three days for a month, using arsenate molybdatereagent and Bradford method,...

متن کامل

Protein remote homology detection based on auto-cross covariance transformation

Protein remote homology detection is a critical step toward annotating its structure and function. Supervised learning algorithms such as support vector machine are currently the most accurate methods. The position-specific score matrices (PSSMs) contain wealthy information about the evolutionary relationship of proteins. However, the PSSMs often have different lengths, which are difficult to b...

متن کامل

SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection

MOTIVATION As the amount of biological sequence data continues to grow exponentially we face the increasing challenge of assigning function to this enormous molecular 'parts list'. The most popular approaches to this challenge make use of the simplifying assumption that similar functional molecules, or proteins, sometimes have similar composition, or sequence. However, these algorithms often fa...

متن کامل

Spatially-constrained sample kernel for sequence classification

Kernel-based learning methods provide some of the most accurate results in many sequence analysis and prediction tasks [1, 2, 4, 6]. However, the improved accuracy is often achieved at the cost of high computational complexity of training and prediction. We propose a new family of the string-based kernel classification methods for the sequence analysis tasks that offer low computational cost an...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2008

Fast and accurate semi-supervised protein homology detection with large uncurated sequence databases

نویسندگان

چکیده

منابع مشابه

A fast, large-scale learning method for protein sequence classification

Expression analyses of endoglucanase gene in Penicillium oxalicum and Trichoderma viride

Protein remote homology detection based on auto-cross covariance transformation

SVM-HUSTLE - an iterative semi-supervised machine learning approach for pairwise protein remote homology detection

Spatially-constrained sample kernel for sequence classification

عنوان ژورنال:

اشتراک گذاری